-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate #28876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
| val distinctAttributes = namedDistinctExpressions.map(_.toAttribute) | ||
| // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because | ||
| // `groupingExpressions` is not extracted during logical phase. | ||
| val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e => | |
| val normalizedNamedDistinctExpressions = namedDistinctExpressions.map { e => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a basic catalyst question and feel free to send me away. The question is what about being "named" is a requirement in this case. I bet it has to do with expression binding, but I am not entirely sure, and was wondering if you had that answer since you had to special case it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are questioning about why we need to have named expressions here, I think it is because we need these distinct expressions to be in the result expressions in Aggregate physical operator. These result expressions are for the output attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time @viirya. I am not 100% sure when all the cases that need named expression, but that the physical node output expressions need to be named, makes sense to me. Seems like any downstream node that needs to refer to an output needs things like ExprId in order to distinguish fields.
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
|
@viirya thanks for looking at this issue. |
|
Thank you for pinging me, @viirya . |
|
cc @gatorsmile too since this is a correctness issue at 3.0.0. I believe we need to include this in 3.0.1. |
|
@abellina @dongjoon-hyun Thanks for comment. |
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
Outdated
Show resolved
Hide resolved
|
Looks right to me |
This comment has been minimized.
This comment has been minimized.
abellina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This comment has been minimized.
This comment has been minimized.
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
retest this please |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you so much, @viirya .
|
Test build #124354 has finished for PR 28876 at commit
|
…nct aggregate ### What changes were proposed in this pull request? This patch applies `NormalizeFloatingNumbers` to distinct aggregate to fix a regression of distinct aggregate on NaNs. ### Why are the changes needed? We added `NormalizeFloatingNumbers` optimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it. ### Does this PR introduce _any_ user-facing change? Yes, fixing a regression of distinct aggregate on NaNs. ### How was this patch tested? Added unit test. Closes #28876 from viirya/SPARK-32038. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 2e4557f) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Thank you all. Merged to master/3.0. |
|
late LGTM, thanks, @viirya |
SPARK-32038 reports a regression in Apache Spark (3.0.0), in failing to normalize NaN/Zero float values, during DISTINCT aggregations. This causes a mismatch in results between Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which returns the right results). SPARK-32038 was fixed in apache/spark#28876. This commit introduces a conditional xfail test that passes on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038), but produces an expected failure on Spark 3.0.0.
SPARK-32038 reports a regression in Apache Spark (3.0.0), in failing to normalize NaN/Zero float values, during DISTINCT aggregations. This causes a mismatch in results between Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which returns the right results). SPARK-32038 was fixed in apache/spark#28876. This commit introduces a conditional xfail test that passes on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038), but produces an expected failure on Spark 3.0.0.
SPARK-32038 reports a regression in Apache Spark (3.0.0), in failing to normalize NaN/Zero float values, during DISTINCT aggregations. This causes a mismatch in results between Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which returns the right results). SPARK-32038 was fixed in apache/spark#28876. This commit introduces a conditional xfail test that passes on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038), but produces an expected failure on Spark 3.0.0.
What changes were proposed in this pull request?
This patch applies
NormalizeFloatingNumbersto distinct aggregate to fix a regression of distinct aggregate on NaNs.Why are the changes needed?
We added
NormalizeFloatingNumbersoptimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it.Does this PR introduce any user-facing change?
Yes, fixing a regression of distinct aggregate on NaNs.
How was this patch tested?
Added unit test.